1 Installations

ASURAT is available on GitHub.

devtools::install_github("keita-iida/ASURAT", upgrade = "never")

Attach necessary libraries:

library(ASURAT)
library(SingleCellExperiment)
library(SummarizedExperiment)


2 Goal

A goal of ASURAT is to cluster and characterize individual samples (cells) in terms of cell type (or disease), biological function, signaling pathway activity, and so on (see here).


3 Quick start by SingleCellExperiment objects

Having a SingleCellExperiment object (e.g., sce), one can use ASURAT by confirming the following requirements:

  • assays(sce) contains gene expression data with row and column names as variable (gene) and sample (cell), respectively,

If sce contains normalized expression data (e.g., assay(sce, "logcounts")), set assay(sce, "centered") by subtracting the data with the mean expression levels across samples (cells).

mat <- as.matrix(assay(sce, "logcounts"))
assay(sce, "centered") <- sweep(mat, 1, apply(mat, 1, mean), FUN = "-")

One may use a Seurat function Seurat::as.SingleCellExperiment() for converting Seurat objects into SingleCellExperiment ones.

Now, ready for the next step here.


4 Preprocessing

4.1 Load SingleCellExperiment objects

Load a single-cell RNA sequencing (scRNA-seq) data.

urlpath <- "https://github.com/keita-iida/ASURAT_DB/blob/main/transcriptome/"
load(url(paste0(urlpath, "pbmc_counts.rda?raw=true")))

Here, pbmc_counts is a read count table of peripheral blood mononuclear cells (PBMCs), processed by extremely conservative quality controls.

Below is a head of pbmc_counts:

pbmc_counts[1:5, 1:3]
#>            AAACCTGCAAGGTTCT-1 AAACCTGCAGGCGATA-1 AAACCTGCATGAAGTA-1
#> FO538757.2                  0                  0                  1
#> NOC2L                       0                  0                  0
#> ISG15                       0                  0                  4
#> SDF4                        0                  1                  1
#> UBE2J2                      0                  0                  0

Create SingleCellExperiment objects by inputting gene expression data.

pbmc <- SingleCellExperiment(assays = list(counts = pbmc_counts),
                             rowData = data.frame(gene = rownames(pbmc_counts)),
                             colData = data.frame(cell = colnames(pbmc_counts)))

Check data sizes.

dim(pbmc)
#> [1] 4060 2000


4.2 Control data quality

Remove variables (genes) and samples (cells) with low quality, by processing the following three steps:

  1. remove variables based on expression profiles across samples,
  2. remove samples based on the numbers of reads and nonzero expressed variables,
  3. remove variables based on the mean read counts across samples.

First of all, add metadata for both variables and samples using ASURAT function add_metadata().

The arguments are

  • sce: SingleCellExperiment object, and
  • mitochondria_symbol: a string representing for mitochondrial genes.
pbmc <- add_metadata(sce = pbmc, mitochondria_symbol = "^MT-")

One can check the results in rowData(sce) and colData(sce) slots.


4.2.1 Remove variables based on expression profiles

ASURAT function remove_variables() removes variable (gene) data such that the numbers of non-zero expressing samples (cells) are less than min_nsamples.

pbmc <- remove_variables(sce = pbmc, min_nsamples = 10)


4.2.2 Remove samples based on expression profiles

Qualities of sample (cell) data are confirmed based on proper visualization of colData(sce). ASURAT function plot_dataframe2D() shows scatter plots of two-dimensional data (see here for details).

dataframe2D <- data.frame(x = colData(pbmc)$nReads, y = colData(pbmc)$nGenes)
plot_dataframe2D(dataframe2D = dataframe2D) +
  ggplot2::labs(x = "Number of reads", y = "Number of genes") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")

dataframe2D <- data.frame(x = colData(pbmc)$nReads, y = colData(pbmc)$percMT)
plot_dataframe2D(dataframe2D = dataframe2D) +
  ggplot2::labs(x = "Number of reads", y = "Perc of MT reads") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")

ASURAT function remove_samples() removes sample (cell) data by setting cutoff values for the metadata.

The arguments are

  • sce: SingleCellExperiment object,
  • min_nReads and max_nReads: minimum and maximum number of reads,
  • min_nGenes and max_nGenes: minimum and maximum number of non-zero expressed genes, and
  • min_percMT and max_percMT: minimum and maximum percent of reads that map to mitochondrial genes, respectively.
pbmc <- remove_samples(sce = pbmc, min_nReads = 0, max_nReads = 1e+10,
                       min_nGenes = 0, max_nGenes = 1e+10,
                       min_percMT = NULL, max_percMT = NULL)


4.2.3 Remove variables based on the mean read counts

Qualities of variable (gene) data are confirmed based on proper visualization of rowData(sce). ASURAT function plot_dataframe2D() shows scatter plots of two-dimensional data (see here for details).

dataframe2D <- data.frame(x = 1:nrow(rowData(pbmc)),
                          y = sort(rowData(pbmc)$nSamples, decreasing = TRUE))
plot_dataframe2D(dataframe2D = dataframe2D) +
  ggplot2::labs(x = "Rank of genes", y = "Mean read counts") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")

ASURAT function remove_variables_second() removes variable (gene) data such that the mean read counts across samples are less than min_meannReads.

pbmc <- remove_variables_second(sce = pbmc, min_meannReads = 0.01)


4.3 Normalize data

Normalize data using bayNorm functions (Tang et al., Bioinformatics, 2020) and log transformation with a pseudo count. Then, we center the log-normalized data by subtracting with the mean expression levels across samples (cells). The resulting normalized-and-centered data are used for downstream analyses.

Perform bayNorm() for attenuating technical biases with respect to zero inflation and variation of capture efficiencies between samples (cells).

bayout <- bayNorm::bayNorm(Data = counts(pbmc), mode_version = TRUE)
assay(pbmc, "normalized") <- bayout$Bay_out

Perform log-normalization with a pseudo count.

assay(pbmc, "logcounts") <- log(assay(pbmc, "normalized") + 1)

Center row data.

mat <- assay(pbmc, "logcounts")
assay(pbmc, "centered") <- sweep(mat, 1, apply(mat, 1, mean), FUN = "-")


5 Multifaceted sign analysis

Infer cell or disease types, biological functions, and signaling pathway activity at the single-cell level by inputting related databases.

ASURAT transforms centered read count tables to functional feature matrices, termed sign-by-sample matrices (SSMs). Using SSMs, perform unsupervised clustering of samples (cells).


5.1 Load databases

Load databases.

urlpath <- "https://github.com/keita-iida/ASURAT_DB/blob/main/genes2bioterm/"
load(url(paste0(urlpath, "20220108_human_COMSig.rda?raw=TRUE"))) # CO & MSigDB
load(url(paste0(urlpath, "20201213_human_GO_red.rda?raw=TRUE"))) # GO
load(url(paste0(urlpath, "20201213_human_KEGG.rda?raw=TRUE")))   # KEGG

Prepare correlation matrices of gene expressions.

pbmc_cormat <- cor(t(assay(pbmc, "centered")), method = "spearman")

Set gene expression data into altExp(sce).

Tips: Take care not to use a slot name “log-normalized” for altExp(sce), which may produce an error when using a Seurat (version 4.0.5) function as.Seurat() in the downstream analysis.

sname <- "logcounts"
altExp(pbmc, sname) <- SummarizedExperiment(list(counts = assay(pbmc, sname)))

Add ENTREZ Gene IDs to rowData(sce).

dictionary <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
                                    key = rownames(pbmc),
                                    columns = "ENTREZID", keytype = "SYMBOL")
dictionary <- dictionary[!duplicated(dictionary$SYMBOL), ]
rowData(pbmc)$geneID <- dictionary$ENTREZID

Add formatted databases to metadata(sce)$sign.

pbmcs <- list(CM = pbmc, GO = pbmc, KG = pbmc)
metadata(pbmcs$CM) <- list(sign = human_COMSig[["cell"]])
metadata(pbmcs$GO) <- list(sign = human_GO[["BP"]])
metadata(pbmcs$KG) <- list(sign = human_KEGG[["pathway"]])


5.2 Create signs

ASURAT function remove_signs() redefines functional gene sets for the input database by removing genes, which are not included in rownames(sce), and further removes biological terms including too few or too many genes.

The arguments are

  • sce: SingleCellExperiment object,
  • min_ngenes: minimal number of genes> 1 (the default value is 2), and
  • max_ngenes: maximal number of genes> 1 (the default value is 1000).
pbmcs$CM <- remove_signs(sce = pbmcs$CM, min_ngenes = 2, max_ngenes = 1000)
pbmcs$GO <- remove_signs(sce = pbmcs$GO, min_ngenes = 2, max_ngenes = 1000)
pbmcs$KG <- remove_signs(sce = pbmcs$KG, min_ngenes = 2, max_ngenes = 1000)

The results are stored in metadata(sce)$sign.

ASURAT function cluster_genes() clusters functional gene sets using a correlation graph-based decomposition method, which produces strongly, variably, and weakly correlated gene sets (SCG, VCG, and WCG, respectively).

The arguments are

  • sce: SingleCellExperiment object,
  • cormat: correlation matrix of gene expressions,
  • th_posi and th_nega: threshold values of positive and negative correlation coefficients, respectively.

Tips: Empirically, typical values of th_posi and th_nega are \(0.15 \le {\rm th{\_}posi} \le 0.4\) and \(-0.4 \le {\rm th{\_}nega} \le -0.15\), but one cannot avoid trial and error for setting these values. An exhaustive parameter searching is time-consuming but helpful for obtaining “optimal” values.

set.seed(8)
pbmcs$CM <- cluster_genesets(sce = pbmcs$CM, cormat = pbmc_cormat,
                             th_posi = 0.20, th_nega = -0.20)
pbmcs$GO <- cluster_genesets(sce = pbmcs$GO, cormat = pbmc_cormat,
                             th_posi = 0.24, th_nega = -0.20)
pbmcs$KG <- cluster_genesets(sce = pbmcs$KG, cormat = pbmc_cormat,
                             th_posi = 0.20, th_nega = -0.20)

The results are stored in metadata(sce)$sign.

ASURAT function create_signs() creates signs by the following criteria:

  1. the number of genes in SCG>= min_cnt_strg (the default value is 2) and
  2. the number of genes in VCG>= min_cnt_vari (the default value is 2),

which are independently applied to SCGs and VCGs, respectively.

Tips: Empirically, typical values of min_cnt_strg and min_cnt_vari are \(2 \le {\rm min\_cnt\_strg} = {\rm min\_cnt\_vari} \le 3\), but one cannot avoid trial and error for setting these values. An exhaustive parameter searching is time-consuming but helpful for obtaining “optimal” values.

pbmcs$CM <- create_signs(sce = pbmcs$CM, min_cnt_strg = 2, min_cnt_vari = 2)
pbmcs$GO <- create_signs(sce = pbmcs$GO, min_cnt_strg = 2, min_cnt_vari = 2)
pbmcs$KG <- create_signs(sce = pbmcs$KG, min_cnt_strg = 2, min_cnt_vari = 2)

The results are stored in metadata(sce)$sign_all, where “CorrType” indicates SCG or VCG, “Corr” means the average correlation coefficients of SCG or VCG, “CorrWeak” means the average correlation coefficients of WCG, “CorrGene” means SCG or VCG, and “WeakCorrGene” means WCG. The orders of gene symbols and ENTREZ IDs, separated by “/”, are consistent.


5.3 Select useful signs

If signs have semantic similarity information, one can use ASURAT function remove_signs_redundant() for removing redundant sings using the semantic similarity matrices.

The arguments are

  • sce: SingleCellExperiment object,
  • similarity_matrix: a semantic similarity matrix,
  • threshold: a threshold value of semantic similarity, used for regarding biological terms as similar ones, and
  • keep_rareID: if TRUE, biological terms with the larger ICs are kept.

Tips: The optimal value of threshold depends on the ontology structure as well as the method for computing semantic similarity matrix.

pbmcs$GO <- remove_signs_redundant(
  sce = pbmcs$GO, similarity_matrix = human_GO$similarity_matrix$BP,
  threshold = 0.80, keep_rareID = TRUE)

The results are stored in metadata(sce)$sign_SCG, metadata(sce)$sign_VCG, metadata(sce)$sign_all, and if there exist, metadata(sce)$sign_SCG_redundant and metadata(sce)$sign_VCG_redundant.

ASURAT function remove_signs_manually() removes signs by specifying IDs (e.g., GOID:XXX) or descriptions (e.g., metabolic) using grepl(). The arguments are sce and keywords (keywords separated by |).

keywords <- "Covid19|foofoo|hogehoge"
pbmcs$KG <- remove_signs_manually(sce = pbmcs$KG, keywords = keywords)

The results are stored in metadata(sce)$sign_SCG, metadata(sce)$sign_VCG, and metadata(sce)$sign_all.

There is another ASURAT function select_signs_manually(), a counter part of remove_signs_manually(), which removes signs that do not include keywords (keywords separated by |).

keywords <- "lung|pleural|respiratory"
test <- select_signs_manually(sce = pbmcs$CM, keywords = keywords)

The results are stored in metadata(sce)$sign_SCG, metadata(sce)$sign_VCG, and metadata(sce)$sign_all.


5.4 Create sign-by-sample matrices

ASURAT function create_sce_signmatrix() creates a new SingleCellExperiment object new_sce, consisting of the following information:

  • assayNames(new_sce): counts (SSM whose entries are termed sign scores),
  • names(colData(new_sce)): nReads, nGenes, percMT,
  • names(rowData(new_sce)): ParentSignID, Description, CorrGene, etc.,
  • names(metadata(new_sce)): sign_SCG, sign_VCG, etc.,
  • altExpNames(new_sce): something if there is data in altExp(sce).

The arguments are

  • sce: SingleCellExperiment object,
  • weight_strg: weight parameter for SCG (the default value is 0.5), and
  • weight_vari: weight parameter for VCG (the default is 0.5).
pbmcs$CM <- makeSignMatrix(sce = pbmcs$CM, weight_strg = 0.5, weight_vari = 0.5)
pbmcs$GO <- makeSignMatrix(sce = pbmcs$GO, weight_strg = 0.5, weight_vari = 0.5)
pbmcs$KG <- makeSignMatrix(sce = pbmcs$KG, weight_strg = 0.5, weight_vari = 0.5)

Below are head and tail of assay(sce, "counts"):

rbind(head(assay(pbmcs$CM, "counts")[, 1:3], n = 4),
      tail(assay(pbmcs$CM, "counts")[, 1:3], n = 4))
#>              AAACCTGCAAGGTTCT-1 AAACCTGCAGGCGATA-1 AAACCTGCATGAAGTA-1
#> CL:0000018-S         0.13161770        -0.29905179         0.30696763
#> CL:0000056-S         0.08073730        -0.14019300        -0.15349383
#> CL:0000057-S        -0.14249221         0.07927276         0.16488095
#> CL:0000071-S        -0.09498821        -0.09776132         0.35532711
#> MSigID:274-V        -0.13984149        -0.13984149         0.92932504
#> MSigID:276-V         0.12602370         0.02884276        -0.07110628
#> MSigID:278-V         0.03709611        -0.04886317        -0.27079070
#> MSigID:295-V         0.40278537         0.34436096         0.11186053


## Reduce dimensions of sign-by-sample matrices {#dim_reduction} Perform t-distributed stochastic neighbor embedding.

set.seed(1)
for(i in seq_len(length(pbmcs))){
  res <- Rtsne::Rtsne(t(assay(pbmcs[[i]], "counts")), dim = 2, pca = FALSE)
  reducedDim(pbmcs[[i]], "TSNE") <- res[["Y"]]
}

Perform uniform manifold approximation and projection.

set.seed(1)
res <- umap::umap(t(assay(pbmcs$CM, "counts")), n_components = 3)
reducedDim(pbmcs$CM, "UMAP") <- res[["layout"]]

The results can be visualized by ASURAT functions plot_dataframe2D() or plot_dataframe3D() (see here for details).

dataframe2D <- as.data.frame(reducedDim(pbmcs$CM, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D) +
  ggplot2::labs(title = "PBMC (CO & MSigDB)", x = "tSNE_1", y = "tSNE_2") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica")

dataframe2D <- as.data.frame(reducedDim(pbmcs$GO, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D) +
  ggplot2::labs(title = "PBMC (GO)", x = "tSNE_1", y = "tSNE_2") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica")

dataframe2D <- as.data.frame(reducedDim(pbmcs$KG, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D) +
  ggplot2::labs(title = "PBMC (KEGG)", x = "tSNE_1", y = "tSNE_2") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica")

dataframe3D <- as.data.frame(reducedDim(pbmcs$CM, "UMAP")[, 1:3])
plot_dataframe3D(dataframe3D = dataframe3D, theta = 45, phi = 20,
                 title = "PBMC (CO & MSigDB)",
                 xlabel = "UMAP_1", ylabel = "UMAP_2", zlabel = "UMAP_3")


5.5 Cluster cells

5.5.1 Use Seurat functions

To date (December, 2021), one of the most useful clustering methods in scRNA-seq data analysis is a combination of a community detection algorithm and graph-based unsupervised clustering, developed in Seurat package.

In this tutorial, our strategy is as follows:

  1. convert SingleCellExperiment objects into Seurat objects (note that rowData() and colData() must have data),
  2. perform ScaleData(), RunPCA(), FindNeighbors(), and FindClusters(),
  3. convert Seurat objects into temporal SingleCellExperiment objects temp,
  4. add colData(temp)$seurat_clusters into colData(sce)$seurat_clusters.
resolutions <- c(0.15, 0.15, 0.25)
for(i in seq_len(length(pbmcs))){
  pbmc_surt <- Seurat::as.Seurat(pbmcs[[i]], counts = "counts", data = "counts")
  pbmc_surt[["SSM"]] <- Seurat::CreateAssayObject(counts = as.matrix(assay(pbmcs[[i]], "counts")))
  Seurat::DefaultAssay(pbmc_surt) <- "SSM"
  pbmc_surt <- Seurat::ScaleData(pbmc_surt, features = rownames(pbmc_surt))
  pbmc_surt <- Seurat::RunPCA(pbmc_surt, features = rownames(pbmc_surt))
  pbmc_surt <- Seurat::FindNeighbors(pbmc_surt, reduction = "pca", dims = 1:20)
  pbmc_surt <- Seurat::FindClusters(pbmc_surt, resolution = resolutions[i])
  pbmc_temp <- Seurat::as.SingleCellExperiment(pbmc_surt)
  colData(pbmcs[[i]])$seurat_clusters <- colData(pbmc_temp)$seurat_clusters
}

The results can be visualized by ASURAT functions plot_dataframe2D() or plot_dataframe3D() (see here for details).

labels <- colData(pbmcs$CM)$seurat_clusters
dataframe2D <- as.data.frame(reducedDim(pbmcs$CM, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D, labels = labels, colors = NULL) +
  ggplot2::labs(title = "PBMC (CO & MSigDB)", x = "tSNE_1", y = "tSNE_2", color = "") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica") +
  ggplot2::guides(colour = ggplot2::guide_legend(override.aes = list(size = 4)))

labels <- colData(pbmcs$GO)$seurat_clusters
dataframe2D <- as.data.frame(reducedDim(pbmcs$GO, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D, labels = labels, colors = NULL) +
  ggplot2::labs(title = "PBMC (GO)", x = "tSNE_1", y = "tSNE_2", color = "") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica") +
  ggplot2::guides(colour = ggplot2::guide_legend(override.aes = list(size = 4)))

labels <- colData(pbmcs$KG)$seurat_clusters
dataframe2D <- as.data.frame(reducedDim(pbmcs$KG, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D, labels = labels, colors = NULL) +
  ggplot2::labs(title = "PBMC (KEGG)", x = "tSNE_1", y = "tSNE_2", color = "") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica") +
  ggplot2::guides(colour = ggplot2::guide_legend(override.aes = list(size = 4)))

dataframe3D <- as.data.frame(reducedDim(pbmcs$CM, "UMAP")[, 1:3])
labels <- colData(pbmcs$CM)$seurat_clusters
plot_dataframe3D(dataframe3D = dataframe3D, labels = labels, colors = NULL,
                 theta = 45, phi = 20, title = "PBMC (CO & MSigDB)",
                 xlabel = "UMAP_1", ylabel = "UMAP_2", zlabel = "UMAP_3")


5.5.2 Cell cycle inference using Seurat functions

If there is gene expression data in altExp(sce), one can infer cell cycle phases by using Seurat functions in the similar manner as above.

pbmc_surt <- Seurat::as.Seurat(pbmcs$CM, counts = "counts", data = "counts")
pbmc_surt[["GEM"]] <- Seurat::CreateAssayObject(counts = as.matrix(assay(altExp(pbmcs$CM), "counts")))
Seurat::DefaultAssay(pbmc_surt) <- "GEM"
pbmc_surt <- Seurat::ScaleData(pbmc_surt, features = rownames(pbmc_surt))
pbmc_surt <- Seurat::RunPCA(pbmc_surt, features = rownames(pbmc_surt))
pbmc_surt <- Seurat::CellCycleScoring(pbmc_surt,
                                      s.features = Seurat::cc.genes$s.genes,
                                      g2m.features = Seurat::cc.genes$g2m.genes)
pbmc_temp <- Seurat::as.SingleCellExperiment(pbmc_surt)
colData(pbmcs$CM)$Phase <- colData(pbmc_temp)$Phase


5.6 Investigate significant signs

Significant signs are analogous to differentially expressed genes but bear biological meanings. Note that naïve usages of statistical tests should be avoided because the row vectors of SSMs are centered.

Instead, ASURAT function compute_sepI_all() computes separation indices for each cluster against the others. Briefly, a separation index “sepI”, ranging from -1 to 1, is a nonparametric measure of significance of a given sign score for a given subpopulation. The larger (resp. smaller) sepI is, the more reliable the sign is as a positive (resp. negative) marker for the cluster.

The arguments are

  • sce: SingleCellExperiment object and
  • labels: a vector of labels of all the samples.
for(i in seq_len(length(pbmcs))){
  labels <- colData(pbmcs[[i]])$seurat_clusters
  pbmcs[[i]] <- compute_sepI_all(sce = pbmcs[[i]], labels = labels,
                                 nrand_samples = floor(0.5 * ncol(pbmcs[[i]])))
}

ASURAT function plot_violin() shows violin plots showing sign score distributions (see here for details).

vname <- "MSigID:92-S"
pbmc_sub <- pbmcs$CM[rownames(pbmcs$CM) %in% vname, ]
labels <- colData(pbmc_sub)$seurat_clusters
dataframe1D <- as.data.frame(t(assay(pbmc_sub, "counts")))
plot_violin(dataframe1D = dataframe1D, labels = labels, colors = NULL) +
  ggplot2::labs(title = paste0(vname, "\n", "B cell (CD79A, CD22, ...)"),
                x = "Cluster (CO & MSigDB)", y = "Sign score", fill = "Cluster") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")


5.7 Investigate significant genes

5.7.1 Use Seurat function

To date (December, 2021), one of the most useful methods of multiple statistical tests in scRNA-seq data analysis is to use a Seurat function FindAllMarkers().

If there is gene expression data in altExp(sce), one can investigate differentially expressed genes by using Seurat functions in the similar manner as described before.

pbmc_surt <- Seurat::as.Seurat(pbmcs$CM, counts = "counts", data = "counts")
#If there is "expressions" data in `altExp(sce)`, set Seurat default assay.
Seurat::DefaultAssay(pbmc_surt) <- "logcounts"
pbmc_surt <- Seurat::SetIdent(pbmc_surt, value = "seurat_clusters")
res <- Seurat::FindAllMarkers(pbmc_surt, only.pos = TRUE,
                              min.pct = 0.25, logfc.threshold = 0.25)
metadata(pbmcs$CM)$marker_genes$all <- res

ASURAT function plog_violin() shows violin plots showing gene expression distributions (see here for details).

vname <- "BANK1"
expr_sub <- altExp(pbmcs$CM, "logcounts")
expr_sub <- expr_sub[rownames(expr_sub) == vname]
labels <- colData(pbmcs$CM)$seurat_clusters
dataframe1D <- as.data.frame(t(assay(expr_sub, "counts")))
plot_violin(dataframe1D = dataframe1D, labels = labels, colors = NULL) +
  ggplot2::labs(title = "BANK1", x = "Cluster (CO & MSigDB)",
                y = "Normalized expression", fill = "Cluster") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")


6 Multifaceted sign analysis

Simultaneously analyze multiple sign-by-sample matrices, which helps us characterize individual samples (cells) from multiple biological aspects.

ASURAT function plot_multiheatmaps() shows heatmaps (ComplexHeatmap object) of sign scores and gene expression levels (if there are), where rows and columns stand for sign (or gene) and sample (cell), respectively (see here for details).

First, select top significant signs and genes for the clustering results with respect to separation index and p-value, respectively.

# Significant signs
marker_signs <- list()
for(i in seq_len(length(pbmcs))){
  marker_signs[[i]] <- metadata(pbmcs[[i]])$marker_signs$all
  marker_signs[[i]] <- dplyr::group_by(marker_signs[[i]], Ident_1)
  marker_signs[[i]] <- dplyr::slice_max(marker_signs[[i]], sepI, n = 1)
}
# Significant genes
marker_genes_CM <- metadata(pbmcs$CM)$marker_genes$all
marker_genes_CM <- dplyr::group_by(marker_genes_CM, cluster)
marker_genes_CM <- dplyr::slice_min(marker_genes_CM, p_val_adj, n = 1)

Then, prepare arguments.

# ssm_list
pbmcs_sub <- list() ; ssm_list <- list()
for(i in seq_len(length(pbmcs))){
  pbmcs_sub[[i]] <- pbmcs[[i]][rownames(pbmcs[[i]]) %in% marker_signs[[i]]$SignID, ]
  ssm_list[[i]] <- assay(pbmcs_sub[[i]], "counts")
}
names(ssm_list) <- c("SSM_COMSig", "SSM_GO", "SSM_KEGG")
# gem_list
expr_sub <- altExp(pbmcs$CM, "logcounts")
expr_sub <- expr_sub[rownames(expr_sub) %in% marker_genes_CM$gene]
gem_list <- list(GeneExpr = assay(expr_sub, "counts"))
# ssmlabel_list
labels <- list() ; ssmlabel_list <- list()
for(i in seq_len(length(pbmcs))){
  labels[[i]] <- data.frame(label = colData(pbmcs_sub[[i]])$seurat_clusters)
  labels[[i]]$color <- NA
  ssmlabel_list[[i]] <- labels[[i]]
}
names(ssmlabel_list) <- c("Label_COMSig", "Label_GO", "Label_KEGG")
# gemlabel_list
label_CC <- data.frame(label = colData(pbmcs$CM)$Phase, color = NA)
gemlabel_list <- list(CellCycle = label_CC)

Finally, plot heatmaps for the selected signs and genes.

set.seed(1)
plot_multiheatmaps(ssm_list = ssm_list, gem_list = gem_list,
                   ssmlabel_list = ssmlabel_list, gemlabel_list = gemlabel_list,
                   nSamples = 100, show_row_names = TRUE, title = "PBMC")


7 Appendix. Visualize computational results

In this section, ASURAT functions for visualization are introduced.


7.1 Violin plot

ASURAT function plot_violin() shows violin plots (ggplot objects) showing sign score or gene expression distributions.

The arguments are

  • dataframe1D: a dataframe with one column,
  • labels: a vector of labels of all the samples, corresponding to colors,
  • colors: a vector of colors of all the samples, corresponding to labels; if colors = NULL, default colors are used.
vname <- "MSigID:162-S"
pbmc_sub <- pbmcs$CM[rownames(pbmcs$CM) %in% vname, ]
labels <- colData(pbmc_sub)$seurat_clusters
colors <- rainbow(length(unique(labels)))[labels]
dataframe1D <- as.data.frame(t(assay(pbmc_sub, "counts")))
plot_violin(dataframe1D = dataframe1D, labels = labels, colors = colors) +
  ggplot2::labs(title = paste0(vname, "\n", "Naive T cell (CD3E, TRAC, ...)"),
                x = "Cluster (CO & MSigDB)", y = "Sign score", fill = "Cluster") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")

vname <- "BANK1"
expr_sub <- altExp(pbmcs$CM, "logcounts")
expr_sub <- expr_sub[rownames(expr_sub) == vname]
labels <- colData(pbmcs$CM)$seurat_clusters
colors <- rainbow(length(unique(labels)))[labels]
dataframe1D <- as.data.frame(t(assay(expr_sub, "counts")))
plot_violin(dataframe1D = dataframe1D, labels = labels, colors = colors) +
  ggplot2::labs(title = "PBMC", x = "Cluster (CO & MSigDB)",
                y = "Normalized\nexpression", fill = "Cluster") +
  ggplot2::theme_classic(base_size = 20, base_family = "Helvetica")


7.2 Low dimensional representation

ASURAT function plot_dataframe2D() visualizes two-dimensional data with or without labels and colors.

The arguments are

  • dataframe2D: a dataframe with two columns,
  • labels: a vector of labels of all the samples, corresponding to colors,
  • colors: a vector of colors of all the samples, corresponding to labels; if colors = NULL, default colors are used.
labels <- colData(pbmcs$CM)$seurat_clusters
colors <- rainbow(length(unique(labels)))[labels]
dataframe2D <- as.data.frame(reducedDim(pbmcs$CM, "TSNE"))
plot_dataframe2D(dataframe2D = dataframe2D, labels = labels, colors = colors) +
  ggplot2::labs(title = "PBMC (CO & MSigDB)", x = "tSNE_1", y = "tSNE_2", color = "") +
  ggplot2::theme_classic(base_size = 15, base_family = "Helvetica") +
  ggplot2::guides(colour = ggplot2::guide_legend(override.aes = list(size = 4)))

ASURAT function plot_dataframe3D() visualizes three-dimensional data with or without labels and colors.

The arguments are

  • dataframe3D: a dataframe with three columns,
  • labels: a vector of labels of all the samples, corresponding to colors,
  • colors: a vector of colors of all the samples, corresponding to labels; if colors = NULL, default colors are used,
  • theta and phi: angles in degree measure,
  • title, title_size, xlabel, ylabel, and zlabel.
dataframe3D <- as.data.frame(reducedDim(pbmcs$CM, "UMAP")[, 1:3])
labels <- colData(pbmcs$CM)$seurat_clusters
plot_dataframe3D(dataframe3D = dataframe3D, labels = labels, colors = NULL,
                 theta = 45, phi = 20, title = "PBMC (CO & MSigDB)",
                 xlabel = "UMAP_1", ylabel = "UMAP_2", zlabel = "UMAP_3")


7.3 Heatmap

ASURAT function plot_heatmaps() shows heatmaps (ComplexHeatmap objects) of sign scores and gene expression levels (if there are), where rows and columns stand for sign (or gene) and sample (cell), respectively. Here, the column dendrogram is computed by the first component of ssm_list based on ComplexHeatmap functions.

The arguments are

  • ssm_list: list of sign-by-sample matrices,
  • gem_list: list of gene-by-sample matrices,
  • ssmlabel_list: NULL or a list of dataframes of sample (cell) labels and colors; the length of the list must be as same as that of ssm_list, and the order of labels in each list must be as same as those in ssm_list,
  • gemlabel_list: NULL or a list of dataframes of sample (cell) annotations and colors; the length of the list must be as same as that of gem_list, and the order of labels in each list must be as same as those in gem_list,
  • nSamples: number of samples (cells) used for random sampling,
  • show_row_names: TRUE or FALSE: if TRUE, row names are shown unless the maximum number of rows exceeds a certain value,
  • title: Title.

Prepare arguments.

# ssm_list
ssm_list <- list(SSM_COMSig = assay(pbmcs$CM, "counts"),
                 SSM_GO     = assay(pbmcs$GO, "counts"),
                 SSM_KEGG   = assay(pbmcs$KG, "counts"))
# gem_list
gem_list <- list(GeneExpr = assay(altExp(pbmcs$CM, "logcounts"), "counts"))
# ssmlabel_list
labels <- list() ; ssmlabel_list <- list()
for(i in seq_len(length(pbmcs))){
  labels[[i]] <- data.frame(label = colData(pbmcs[[i]])$seurat_clusters)
  labels[[i]]$color <- rainbow(length(unique(labels[[i]]$label)))[labels[[i]]$label]
  ssmlabel_list[[i]] <- labels[[i]]
}
names(ssmlabel_list) <- c("Label_COMSig", "Label_GO", "Label_KEGG")
# gemlabel_list
label_CC <- data.frame(label = colData(pbmcs$CM)$Phase, color = NA)
gemlabel_list <- list(CellCycle = label_CC)

Plot heatmaps.

set.seed(1)
plot_multiheatmaps(ssm_list = ssm_list, gem_list = gem_list,
                   ssmlabel_list = ssmlabel_list, gemlabel_list = gemlabel_list,
                   nSamples = 100, show_row_names = FALSE, title = "PBMC")


8 Session information

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
#>  [3] Biobase_2.50.0              GenomicRanges_1.42.0       
#>  [5] GenomeInfoDb_1.26.7         IRanges_2.24.1             
#>  [7] S4Vectors_0.28.1            BiocGenerics_0.36.1        
#>  [9] MatrixGenerics_1.2.1        matrixStats_0.61.0         
#> [11] ASURAT_0.3.0               
#> 
#> loaded via a namespace (and not attached):
#>   [1] snow_0.4-4             circlize_0.4.13        plyr_1.8.6            
#>   [4] igraph_1.2.7           lazyeval_0.2.2         splines_4.0.4         
#>   [7] BiocParallel_1.24.1    listenv_0.8.0          scattermore_0.7       
#>  [10] ggplot2_3.3.5          digest_0.6.28          foreach_1.5.1         
#>  [13] htmltools_0.5.2        fansi_0.5.0            magrittr_2.0.1        
#>  [16] memoise_2.0.0          tensor_1.5             cluster_2.1.2         
#>  [19] ROCR_1.0-11            limma_3.46.0           ComplexHeatmap_2.6.2  
#>  [22] globals_0.14.0         spatstat.sparse_2.0-0  askpass_1.1           
#>  [25] colorspace_2.0-2       blob_1.2.2             ggrepel_0.9.1         
#>  [28] xfun_0.27              dplyr_1.0.7            tcltk_4.0.4           
#>  [31] crayon_1.4.2           RCurl_1.98-1.5         jsonlite_1.7.2        
#>  [34] spatstat.data_2.1-0    survival_3.2-13        zoo_1.8-9             
#>  [37] iterators_1.0.13       glue_1.4.2             polyclip_1.10-0       
#>  [40] gtable_0.3.0           zlibbioc_1.36.0        XVector_0.30.0        
#>  [43] leiden_0.3.9           GetoptLong_1.0.5       DelayedArray_0.16.3   
#>  [46] future.apply_1.8.1     shape_1.4.6            abind_1.4-5           
#>  [49] scales_1.1.1           DBI_1.1.1              miniUI_0.1.1.1        
#>  [52] Rcpp_1.0.7             viridisLite_0.4.0      xtable_1.8-4          
#>  [55] clue_0.3-60            spatstat.core_2.3-1    reticulate_1.22       
#>  [58] bit_4.0.4              umap_0.2.7.0           htmlwidgets_1.5.4     
#>  [61] httr_1.4.2             RColorBrewer_1.1-2     ellipsis_0.3.2        
#>  [64] Seurat_4.0.5           ica_1.0-2              pkgconfig_2.0.3       
#>  [67] farver_2.1.0           uwot_0.1.10            deldir_1.0-6          
#>  [70] sass_0.4.0             locfit_1.5-9.4         utf8_1.2.2            
#>  [73] reshape2_1.4.4         tidyselect_1.1.1       labeling_0.4.2        
#>  [76] rlang_0.4.12           later_1.3.0            AnnotationDbi_1.52.0  
#>  [79] munsell_0.5.0          tools_4.0.4            cachem_1.0.6          
#>  [82] generics_0.1.1         RSQLite_2.2.8          ggridges_0.5.3        
#>  [85] evaluate_0.14          stringr_1.4.0          fastmap_1.1.0         
#>  [88] goftest_1.2-3          yaml_2.2.1             org.Hs.eg.db_3.12.0   
#>  [91] knitr_1.36             bit64_4.0.5            fitdistrplus_1.1-6    
#>  [94] purrr_0.3.4            RANN_2.6.1             nlme_3.1-153          
#>  [97] pbapply_1.5-0          future_1.23.0          mime_0.12             
#> [100] compiler_4.0.4         plotly_4.10.0          png_0.1-7             
#> [103] spatstat.utils_2.2-0   tibble_3.1.5           bayNorm_1.8.0         
#> [106] bslib_0.3.1            stringi_1.7.5          highr_0.9             
#> [109] RSpectra_0.16-0        plot3D_1.4             lattice_0.20-45       
#> [112] Matrix_1.3-4           vctrs_0.3.8            pillar_1.6.4          
#> [115] lifecycle_1.0.1        spatstat.geom_2.3-0    lmtest_0.9-38         
#> [118] jquerylib_0.1.4        GlobalOptions_0.1.2    RcppAnnoy_0.0.19      
#> [121] data.table_1.14.2      cowplot_1.1.1          bitops_1.0-7          
#> [124] irlba_2.3.3            httpuv_1.6.3           patchwork_1.1.1       
#> [127] R6_2.5.1               promises_1.2.0.1       gridExtra_2.3         
#> [130] KernSmooth_2.23-20     parallelly_1.28.1      codetools_0.2-18      
#> [133] MASS_7.3-54            assertthat_0.2.1       openssl_1.4.5         
#> [136] rjson_0.2.20           SeuratObject_4.0.2     sctransform_0.3.2     
#> [139] GenomeInfoDbData_1.2.4 mgcv_1.8-38            doSNOW_1.0.19         
#> [142] rpart_4.1-15           grid_4.0.4             tidyr_1.1.4           
#> [145] rmarkdown_2.11         misc3d_0.9-1           Cairo_1.5-12.2        
#> [148] Rtsne_0.15             shiny_1.7.1